{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas 数据预处理\n",
    "## 一.合并数据\n",
    "### 1.使用concat函数堆叠合并数据\n",
    "### 2.使用merge函数通过主键合并数据；\n",
    "\n",
    "## 二.数据清洗\n",
    "### 1.检测并处理缺失值\n",
    "### 2.检测并处理重复值\n",
    "### 3.检测并处理异常值\n",
    "## 知识目标：掌握合并数据和清洗数据的常用方法\n",
    "## 技能目标：掌握检测并做好数据预处理\n",
    "## 重点：concat数据合并与缺失值、重复值、异常值的检测\n",
    "## 难点：缺失值、重复值、异常值的处理\n",
    "========================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 一.合并数据\n",
    "### 1.使用concat函数堆叠合并数据\n",
    "堆叠也称为轴向连接，依照连接轴的方向，数据堆叠可分为横向堆叠（x轴向）和纵向堆叠（y轴向）。（详情示意图见word文档）\n",
    "\n",
    "concat()函数可以沿着一条轴将多个对象进行堆叠，其使用方式类似数据库中的数据表合并。\n",
    "\n",
    "concat()语法：pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None,copy=True)\n",
    "\n",
    "#axis：表示连接的轴向，可以为0或1，默认为0。\n",
    "#join：表示连接的方式，inner表示内连接，outer表示外连接，默认使用外连接。\n",
    "#ignore_index：如果设置为True，清除现有索引并重置索引值。\n",
    "#names：结果分层索引中的层级的名称。\n",
    "#verify_integrity：检查新的连接轴是否包含重复项，接收布尔值，默认为Falae。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import pymysql\n",
    "from sqlalchemy import create_engine\n",
    "engine = create_engine('mysql+pymysql://root:123456@127.0.0.1:3306/dsj?charset=utf8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例1：演示data1与data2 横向堆叠、纵向堆叠\n",
    "data1 = pd.DataFrame({'A':['A1','A2','A3','A4'],\n",
    "                   'B':['B1','B2','B3','B4'],\n",
    "                   'C':['C1','C2','C3','C4'],\n",
    "                   'D':['D1','D2','D3','D4']},index=[1,2,3,4])\n",
    "data2 = pd.DataFrame({ 'B':['B2','B4','B6','B8'],\n",
    "                    'D':['D2','D4','D6','D8'],\n",
    "                    'F':['F2','F4','F6','F8']},index=[2,4,6,8])\n",
    "pd.concat([data1,data2],axis=1,sort=False)    #如果两个对象的形状不一样，合并后会产生不存在的数据，这些数据用NaN填充\n",
    "pd.concat([data1,data2],axis=0,sort=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例2：将多张菜品订单详情表纵向合并\n",
    "detail1 = pd.read_sql('meal_order_detail1',engine)\n",
    "detail2 = pd.read_sql('meal_order_detail2',engine)\n",
    "detail3 = pd.read_sql('meal_order_detail3',engine)\n",
    "detail = pd.concat([detail1,detail2,detail3])\n",
    "detail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*纵向堆叠合并的另一种方式\n",
    "#append()方法实现纵向堆叠有一个前提条件，就是两张表的列名完全一致。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例3：用append()实现例2\n",
    "detail_new = detail1.append([detail2,detail3])\n",
    "detail_new"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.使用merge函数通过主键合并数据\n",
    "主键合并类似于关系型数据库的连接方式，它是指根据一个或多个键将不同的DataFrame对象连接起来，大多数是将两个DataFrame对象中重叠的列作为合并的键，并采用内连接的方式合并数据，即取行索引重叠的部分(示意图见word文档)。\n",
    "\n",
    "①用merge()函数实现主键合并：\n",
    "\n",
    "merge()语法：pd.merge(left, right, how='inner', on=None, left_on=None,right_on=None, left_index=False, right_index=False, sort=False,suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)\n",
    "#left：参与合并的左侧DataFrame对象。\n",
    "#right：参与合并的右侧DataFrame对象。\n",
    "#how：表示连接方式，默认为inner。\n",
    "#on：用于连接的列名，必须存在与左右两个DataFrame对象中。\n",
    "#left_on：以左侧DataFrame作为连接键。\n",
    "#right_on：以右侧DataFrame作为连接键。\n",
    "#left_index：左侧的行索引用作连接键。\n",
    "#right_index：右侧的行索引用作连接键。\n",
    "#suffixes：用于追加到重叠列名的末尾。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例4：连接“超市营业额2”文件中sheet1、sheet2和sheet3的数据\n",
    "df1 = pd.read_excel(r'D:/pyData/超市营业额2.xlsx')\n",
    "df2 = pd.read_excel(r'D:/pyData/超市营业额2.xlsx',sheet_name=1)\n",
    "df3 = pd.read_excel(r'D:/pyData/超市营业额2.xlsx',sheet_name=2)\n",
    "#sheet1与sheet2使用纵向堆叠合并\n",
    "df = pd.concat([df1,df2])\n",
    "#df与sheet3使用主键合并\n",
    "df_new = pd.merge(df,df3,on='工号')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例5：使用4种不同的方式连接“学生信息”表和“选修记录”表，查看结果不同之处\n",
    "data_inner = pd.merge(student,scores,how='inner')\n",
    "data_outer = pd.merge(student,scores,how='outer')\n",
    "data_left = pd.merge(student,scores,how='left')\n",
    "data_right = pd.merge(student,scores,how='right')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考1：如果有多张表如何用连接？使用外键？\n",
    "\n",
    "使用主键合并订单详情表(detail)、订单信息表(meal_order_info.csv)、客户信息(users.xlsx)\n",
    "#detail由detail1，detail2，detail3合并组成，上面已完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "detail1 = pd.read_sql('meal_order_detail1',engine)\n",
    "detail2 = pd.read_sql('meal_order_detail2',engine)\n",
    "detail3 = pd.read_sql('meal_order_detail3',engine)\n",
    "order = pd.read_csv('D:/pyData/meal_order_info.csv',encoding='gbk')\n",
    "user = pd.read_excel('D:/pyData/users.xlsx')\n",
    "detail = pd.concat([detail1,detail2,detail3])\n",
    "\n",
    "#detail['order_id'] = detail['order_id'].astype(int)\n",
    "#detail['emp_id'] = detail['emp_id'].astype(int)\n",
    "#order['info_id'] = order['info_id'].astype(int)\n",
    "#order['emp_id'] = order['emp_id'].astype(int)\n",
    "#user['USER_ID'] = user['USER_ID'].astype(int)\n",
    "\n",
    "#通过观察可以发现：\n",
    "#detail与order存在关联，它们的主键可以是'emp_id'，另外，detail中的'order_id'与order中'的info_id'相同，其中一个可以当外键\n",
    "#order与user存在关联，order中的'emp_id'和user中的‘USER_ID’相同，因此其一也可以当外键\n",
    "data = pd.merge(detail,order,left_on=['order_id','emp_id'],right_on=['info_id','emp_id'])\n",
    "data1 = pd.merge(data,user,left_on='emp_id',right_on='USER_ID',how='inner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "②join方法\n",
    "\n",
    "主键合并数据，除了用merge()函数以外，join方法也可以实现部分主键合并的功能，但是join方法使用时，两个主键的名字必须相同。\n",
    "\n",
    "join语法：\n",
    "pandas.DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)\n",
    "#lsuffix：接收string。表示用于追加到左侧重叠列名的末尾。无默认。\n",
    "#rsuffix：接收string。表示用于追加到右侧重叠列名的末尾。无默认。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "例6：通过主键连接meal_order_detail1与meal_order_info.csv两表\n",
    "#meal_order_detail1的主键为order_id\n",
    "#meal_order_info.csv的主键为info_id\n",
    "#两个主键必须同名同类型，才能使用join\n",
    "detail1['order_id'] = detail1['order_id'].astype(int)   #强制转换类型\n",
    "order.rename({'info_id' : 'order_id'},inplace=True)    #重命名，使两个主键同名。inplace=True表示原地替换\n",
    "order_detail1 = detail.join(order,on='order_id',rsuffix='1') #rsuffix='1'表示，如果有重复的列名，给右边的列名追加一个后缀1\n",
    "order_detail1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 二.数据清洗\n",
    "数据清洗的目的在于提高数据质量，将脏数据清洗干净，使原数据具有完整性、唯一性、权威性、合法性、一致性等特点。\n",
    "  \n",
    "#缺失值——异常值——重复值\n",
    "### 1.检测并处理缺失值\n",
    "一般使用None表示缺失字符串，使用 np.nan/NaN表示缺失数值，使用pd.nat/NaT表示缺失时间。\n",
    "\n",
    "#### ①检测缺失值\n",
    "使用isnull()或notnull()方法判断数据集中是否存在缺失值，返回的是布尔值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例7：判断''是否为缺失值\n",
    "A = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n",
    "                    \"toy\": [np.nan, 'Batmobile', 'Bullwhip'],\n",
    "                   \"born\": [pd.NaT, pd.Timestamp(\"1940-04-25\"),pd.NaT]})\n",
    "A.isnull()\n",
    "#试试将其中一个pd.NaT改为''或者None，将'Bullwhip'改为''\n",
    "B = pd.DataFrame({\"name\": ['Alfred', 'Batman', 'Catwoman'],\n",
    "                    \"toy\": [np.nan, 'Batmobile', ''],\n",
    "                   \"born\": ['', pd.Timestamp(\"1940-04-25\"),pd.NaT]})\n",
    "B.isnull()\n",
    "#结论 字符串中''不是缺失值！但是，如果将时间类别的某一个值改为None或者''，都会转换为缺失值NaT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ②删除缺失值\n",
    "\n",
    "使用dropna()方法删除含有缺失值的行或者列。\n",
    "\n",
    "语法：\n",
    "dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)\n",
    "#axis：确定过滤行或列。 默认axis=0删除的是行，即删除记录；当axis=1删除的是列，即删除特征。\n",
    "#how：确定过滤的标准。any表示某行或某列只要有任意一个缺失值就删除；all表示某行或某列全部是缺失值才执行删除操作。\n",
    "#thresh：表示有效数据量的最小要求。若传入了2，则是要求该行或该列至少有两个非NaN值时将其保留。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例8：检测全部餐饮订单数据是否有缺失值，并删除数据中全部值均为缺失值的列\n",
    "detail.dropna(axis=1,how='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ③填充缺失值\n",
    "Pandas中使用fillna()方法可以实现填充缺失值。缺失值所在特征为数值型时，通常利用其均值、中位数和众数等描述其中趋势的统计量来替代缺失值；缺失值所在的特征为类别型时，则选择使用众数来替代缺失值。\n",
    "\n",
    "语法：fillna(value=None, method=None, axis=None, inplace=False,limit=None, downcast=None, **kwargs)\n",
    "#value：表示用来替换缺失值的值。无默认。可以接收dict、Series或者DataFrame\n",
    "#method：表示填充方式，默认值为None。取值为backfill或bfill时，表示使用下一个非缺失值来填补；取值为pad或ffill时，表示使用上一个非缺失值来填补。\n",
    "#limit： 可以连续填充的最大数量，超过则不进行填补，默认None。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#例9：检测“超市营业额2”中“交易额”数据是否有缺失值，并找出缺失的行,将“交易额”整体均值填充给缺失值。\n",
    "from copy import deepcopy\n",
    "dff = deepcopy(df1)    #深复制，不影响原来的df1\n",
    "dff[dff['交易额'].isnull()]\n",
    "dff.fillna({'交易额':dff['交易额'].mean()},inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "思考2：使用每人交易额均值替换缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dff1 = deepcopy(df1)\n",
    "for i in dff1[dff1['交易额'].isnull()].index:\n",
    "    dff1.loc[i,'交易额'] = dff1.loc[dff1['姓名']==dff1.loc[i,'姓名'],'交易额'].mean()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
